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ABSTRACT Analysis of cellular protein patterns by 
computer-aided 2-dimensional gel electrophoresis together 
with recent advances in protein sequence analysis have 
made possib e the establishment of comprehensive 
2-dimcnsional gel protein databases that may link pro- 
tein and DNA information and that offer a global ap- 
proach to the study of the cell. Using the integrated ap- 
proach offered by 2-dimensional gel protein databases it 
IS now possible to reveal phenotype specific protein (or 
proteins), to microsequence them, to search for homology 
with pre\iously identified proteins, to clone the cDNAs, 
to assign partial protein sequence to genes for which the 
Jul] DNA sequence and the chromosome location is 
known, and to study the regulatory properties and func- 
tion of groups of proteins that are coordinatelv expressed 
in a given biological process. Human 2-dimensional gel 
protein databases are becoming increasingly important in 
view of the concerted effort to map and sequence the en- 
tire fenome.— Celis, J. E.; Rasmussen, H. H.; Leffers, 
H.; Madsen P.; Honore. B.; Cesser, B.; Dejgaard, K.; 
Vandekerckhove, J. Human cellular protein patterns and 
their link to genome DNA sequence data: usefulness of 
two^d»n«sional g^^^^^^^^^^ microsequencing. 

Kr. Words: human protein patterns • 2-dimenswnal gel protein 
databases • gene expression • microsequencing • cDNA cloning 
• tmkmg protein and DNA information • genome mapping and se- 
quencing 



Proteins synthesized from information contained in the 
DNA orchestrate most cellular functions. The total number 

0 proteins svmhesized by a typical human cell is unknown 
although current estimates range from 3000 to 6000 Of 
these, as many as 70% may perform household functions 
and are expected to be shared bv all cell tvpes irrespective of 
their orijrm. There are many different cell types in the hu- 
man body with perhaps 30.000 to 50.000 proteins expressed 
m the organism as a whole judged from the fact that about 

1 .<. o the haploid genome correspond to genes. Todav onlv 
a small fraction of the total set of proteins has been identified 
and little is known about the protein patterns of individual 
cell types or their variation under physiological and abnor- 
mal conditions. 

For the past 15 years, high resolution 2-dimensional eel 
electrophoresis has been the technique of choice to deter- 
mine the protein composition of a given cell type and for 
monitoring changes in gene activity through quantitative 
and qualitative analysis of the thousands of proteins that or- 
chestrate various cellular functions (rcfs 1-6 and references 



therein). The technique originalh described bv OTarrell i 
separates proteins in terms of their isoelectric point tpH an 
molecular ivcight. Usually one choose.^ a condition of in- 
terest and the cell rex-eals the global protein behavioral 
response as all detected proteins can be analxzed both 
qualitatively and quantiiativelv in relation to each other At 
present, most available 2-dimensional eel techniques (regu- 
lar gel format) can resolve bet^^een 1000 and 2000 protein* 
from a given mammalian cell type, a number that cor- 
responds to about 2 million base pairs of coded DNA. Les- 
abundant proteins can be detected b\ analvzine partial! 
purified cellular fractions. 

Two-dimensional gel cctrophoresis has been widclv applied 
to analysis of cellular protein patterns from bacteria to mam- 
malian cells (refs 1-6. and references therein). In spite of 
much work, however, information gathered from these 
studies has not reached the scientific community in its full- 
ness because of lack of standardized gel svstcms and the lack 
of means for storing and communicating protein informa- 
lion. Only recently, because of the development of appropri- 
ate computer software (7-13). has it been possible to scar 
gels assign numbers to individual proteins, and store tht 
wealth of information in quantitative and qualitative com- 
prehensive 2-dimensional gel protein databases (4 14-23) 
I.C.. those containing information about the various proper- 
ties (physical, chemical, biological, biochemical, phvsiologi- 
cal. genetic, immunological, architectural, etc ) of all the 
proteins that can be detected in a given cell type. Such in- 
tegrated 2-dimensional gel protein databases offer an easy 
and standardized medium in which to store and communi- 
cate protein information and provide a unique framework in 
which to focus a multidisciplinarv approach to studv the cell 
Once a protein is identified in the database, all of ihe infor- 
mation accumulated can be easilv retrieved and made availa- 
ble to the researcher. In the long run, protein databases are 
expected to foster a wide variety of biological information 
that may be instrumental to researchers working in many 
areas of biology— among others, cancer and oncogene 
studies, differentiation, development, drug development and 
testing, genetic variation, and diagnosis of genetic and clini- 
cal diseases (Fig. 1). 

The approach using systematic 2-dimensional gel protein 
analysis has recently gained a new dimension with the ad- 
vent of techniques to microsequence major proteins recorded 
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Figure 1. Interface bciween partial protein sequence databases, 
comprchensixc 2-dimensionaJ gel databases, .and the human ge- 
nonne sequencing project. Appropriate software is required to com- 
pare protein and DNA sequences. In general, ahhough the infer- 
ence of a proteins sequence from the DNA sequence (thick arrow) 
is direct and unambiguous, the DNA sequence can only be inferred 
ipproximateiy from the protem sequence (thin arrow) and cloning 
)f the gene requires cither a cDNA or the requisite group of 
jligonucieoiide probes deduced from the partial amino acid se- 
quence. Modified from ref 6. 



in the databases (rcfs 24-42 and references therein). Panial 
protein sequences can be used to search for protein identity 
as well as to prepare specific DNA probes for cloning as-yei- 
uncharactcrized proteins (Fig. 1). .\s these sequences can be 
stored in the database (see for example Fig. 2H), they offer 
I unique opportunity to link information on proteins with 
he existing or forthcoming DNA sequence data on the hu- 
man genome (Fig. I) (20. 36. 39). 

Using the integrated approach offered by comprehensive 
2-dimensional gel databases (Fig. 1), it will be possible to 
identify phenotype-specific proteins; microsequence them 
and store the information in the database: search for homol- 
ogy with previously characterized proteins; clone the 
cDNAs. assign partial protein sequences to genes for which 
the full DNA sequence and the chromosome location arc 
known, and study the regulatory* properties and function of 
groups of proteins (pathways, organelles, etc.) that are coor- 
dinately expressed in a given biological process. Comprehen- 
sive 2-dimensional gel protein databases will depict an in- 
tegrated picture of the expression levels and properties of the 
thousands of protein components of organelles, pathwavs. 
and cyioskcletal systems in both physiological and abnormal 
conditions and are expected to lead to identification of new 
regulatory networks in different cell types and organisms. In 
the future, 2-dimensionaI gel protein databases may be 
linked to each other as well as to national and international 
specialized databanks on nucleic acid and protein sequences, 
protein structures. NMR experimental data, complex carbo- 
hydrates, etc. 

A few 2-dimensional gel protein databases that are accessible 
in a computer form have been published in extenso: these 
correspond to the protein-gene database of Escherichia colt 
K-12 developed by Neidhardi and colleagues (14. 23), the rat 
REF 52 database established by Carrels and co-workers at 
Cold Spring Harbor (18. 22). and a few human databases 
(transformed amnion cells [15. 20). normal embr\-onal lung 
MRC-5 fibroblasts [17. 21). keratinocytes [19] and peripheral 
blood mononuclear cells [15]) developed in Aarhus. Given 
space limitations and to keep this review in focus, we will 
concentrate on the computerized analysis of human cellular 
2-dimensional gel patterns, and in particular on the steps in- 
volved in establishing comprehensive 2-dimensionaI gel 
databases that can link protein and DN.A information. 



MAKING AND MANAGING A COMPREHENsIX F 
2-DIMENSIONAL GEL DATABASE OF HUMAN 
CELLULAR PROTEINS 

The first step in making a comprehensive 2-dimcnsionji ijt ; 
protem database is to prepare a synthetic image i diciial lor:: . 
of the gel image) of the gel (fluorogram. Coomassic blue or sil- 
ver stained gel) to be used as a standard or master reference 
This can be done with laser scanners, charge couple devii t 
(CCD)2 array scanners, television cameras, rotatintj drum 
scanners, and multiwire chambers il3V Computerized anal- 
ysis systems for spot detection, quantitation, pattern match- 
mg, and data handling (access and retrieval of information, 
database making) have been described in the literature 
(ELSIE [43]. GELLAB (11). HERMeS [44]. MELANIE 
[10], QUEST (9), and TYCHO (8)) and some arc available 
commercially (PDQUEST Protein Database Inc. Hunimi:. 
ton. N.Y.; KEPLER, Large Scale Biolog>-. Rockvillc, Md.; 
Visage, Biolmagc Corporation, Ann Arbor, Mich.; Gemini. 
Joyce Loebl. Gateshead; Microscan 1000. Technoloijv 
Resources Inc., Nashville, Tenn. and MasterScan. Billerica. 
Mass.). Unfortunately, most of these systems are incompati- 
ble with one another and their advantages and disadvantages 
have been discussed by Miller (13). 

In our work station in Aarhus, fiuorograms are scanned 
with a Molecular Dynamics laser scanner and the data art- 
analyzed using the PDQUEST II software (Protein Data- 
bases Inc.) (12) running on a spark station computer 4100 
FC-8-P3 from SUN Microsystems, Inc. The scanner meas- 
ures intensity in the range of 0-2.0 absorbance. A typical 
scan of a 17 x 17 cm fluorogram takes about 2 min. Steps 
m image analysis include; initial smoothing, background 
substraction, final smoothing, spot detection, and fitting of 
ideal Gaussian distribution to spot centers. Spot intensity is 
calculated as the integration of a fitted Gaussian. If calibra- 
tion strips containing individual segments of a known 
amount of radioactivity are used, it is possible to merge mul- 
tiple exposures of the sample image into a single data image 
of greater dynamic range. Once the synthetic image is 
created it can be stored on disk and displayed directly on the 
monitor Functions that can be used to edit the images in- 
clude: cancel (for example, to erase scratches that mav have 
been interpreted as spots by the computer; cancel streaks or 
low dpm spots), combine (sometimes a spot mav be resolved 
into several closely packed spots), restore, uncombine, and 
add spot to the gel. The process is time consuming -aboui 
1-1/2 day per image. Edited standard images can be matched 
to other synthetic images. Figure 2A shows a ponion of a 
standard svnthetic image (lEF) of a fluorogram of 
["S]methionine labeled cellular proteins from human AM A 
cells (master database) (20). Images can be displayed either 
in black and white (resembling the original fluorograms) or 
in color (other images in Fig. 2). depending on the need. As 
shown in Fig. 25, each polypeptide is assigned a number by 
the computer, which facilitates the entry and retrieval of 
qualitative and quantitative information for anv given spot 
m the gel (20). The standard image can be matched auto- 
matically by the computer to other standard or reference gels 
(Fig. 2C. matching of AMA cellular proteins [left] to MRC-.l 
proteins [right]) provided a few landmark spots are given 
manually as reference (indicated with a in Fit;. 2C) to in- 
itiate the process. 



^Abbreviations: CCD. charge couple device: PCN.A prolifcrai- 
ine cell nuclear antigen; HPLC. high pcrldrmancc liquid chromaiot;- 
raphy. 



F.gure Synthetic .mage of a fraction of an lEF gel of the master .mage of AM A cellular proteins, fi, As ,n 4 but showinc number. 
Matched ^^^ompanson of AMA (left) and normal human embrvonal lun, MRc' fibroblast. ,r,.h lEF protirn pZ^7 

^^^is^^^ T ^ P^"'^'" ■n^ormat.on contained .n the a .Tus 

T h !^ r """"" '^'"""^"^ transferred. /)) Svrithe.ic .maee of a fraction of an lEF fluoro^ram of ("Slme.h.o 

. '"^t P™"'"' ^'"^ '^^^•^ fibroblasts. The histograms show levels of svnthesis of a fo, pro.J^^T.n MRC Wlef. 

^•t The funcfior^'"™'' MRC-5 (right bar) fibroblasts. £) Polypeptides that con.am information under th/ a.c'Z g. v ok' c p,thwav 
C\ RclZ Z.!. T"°'T," : ' "P""'"^ '° -nformation availablf for a given'^rotcm: 

h tf l^ Poutnt H °f cvtoskeletal-reiatcd proteins in quiescent, proli.ern.mg. and S\'4....ransformed MRC-5 fibrob- 
Lists, ff) Pol> peptides that contain mformation under the category partial amino acid sequences. 
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The automatic matching process that has been described 
in detail by Garrcls et al. (12) takes about 5 min. Matched 
proteins are indicated with the same letters in both gels (Fig. 
2C). The usefulness of this function is emphasized by the fact 
that data accumulated on common household* proteins can 
be easily transferred to any other human cellular cell type 
whose 2-dimensional gel cellular protein pattern is matched 



to our standard AMA 2-dimensional gel protein image. Al- 
ternatively, if the standard gel is part of a matchsei (set of 
gels in a given experiment) it can be used as a linker gel to 
compare, for example, the quantitative values of a given pro- 
tein throughout the experiment (sec Fig. 2D; levels of some 
proteins in normal and SV40 transformed human MRC-5 
fibroblasts) or with other standard images in different sets of 



cross-matched experiments (18, 22). 

Once a standard map of a given protein sample is made 
one can enter qualitative annotations to make a reference 
. database. Our master 2-dimensional gel daubase of trans- 
forroed human amnion cell (AMA) proteins (20) lists 3430 
polypeptides of which 2592 correspond to cellular compo- 
nents, havmg pFs ranging from 4 to 13 and molecular 
weights between 8.5 and 230 IcDa. The most abundant pro- 
tems in the database correspond to total aain (3.87% of total 
protem; about 90 million molecules per cell) while the 
lesser abundant of the recorded polypeptides are present in 
the vicmjty of 5000 molecules per cell. Some annotation 
categories we are using to establish the master AMA data- 
base include: J) protein identification (comigration with 
punfied proteins. 2-dimensional immunoblotting, microse- 
quencing); 2) amounts (total amounts and levels of synthe- 
sis); 3) subcellular localization (nuclear, cytoskeletal, mem- 
brane. membrane receptors, specific organelles, etc.); 4) 
antibodies; 5) posttranslational modifications (phosphoryla- 
tion, glycosylation, methylation etc.); 6) microsequencinr. 7) 
cell cycle specificity (specific variations in levels of synthesis 
and amount); ^ regulatory behavior (effea of hormones 
growth factors heat shock, etc.) 9) rate of synthesis in nor- 
mal and transformed ceUs (proliferation sensitive proteins 
cell cycle specific proteins, oncogenes, components of the 
pathway (or pathways) that control cell proliferation); JO) 
Junction (mainly from comigration with proteins of known 
Junction); JJ) sets of proteins that are coordinately regulated 
(hierarchy of controls, differential gene expression in ^ous 
ceUs etc.); 12) cDNAs (cloned cDNAs); 13) proteins that ^ 
specific to a given disease (systematic comparison of protein 
patterns of fibroblast proteins from healthy and diseased in- 
nvA ?k ""^ exploitation of transfected 

CDNAs; ;5) pathways (metabolic, others); 16) gene localization 
(genetic and physical); 17) effea of microinjeaed antibody 
on patterns of protein synthesis; and 18) secreted proteins 
Information entered for any spot in a given annotation 
catepry can be easUy retrieved by asking the computer to 
display the mformation on the color screen. For ocample 

AMA^ t !r"'^«« '"age of a NEPHGE gel (master 
AMA database) displaying the information contained under 
the entry glycolytic pathway. Alternatively, one can use the 
Junction peruse annotations for spot to directly ask the com- 
puter to list all the entries available for a particular protein. 
By clicking the mouse in a given entry (in this case, presence 
in fct^ human tissues) it is possible to take a quick look at 
tnc information in that panicular entry (Fig 2F) 

A major obstacle encountered in building comprehensive 
^-dimensional gel protein databases is identifying the lar« 
number of proteins separated by this technology. In our 
databases (20, 21). known proteins are identified by one or 
a combination of the following procedures: 7) comigration 
with known protems. 2) 2-dimen$ional gel immunoblotting 
using specific antibodies, and 3) microsequencing of 
Coomassie Brdlant Blue stained human proteins recowred 
from dried 2-dimensional gels (see next section). Protein 
Identification by means of microsequencing may be difficult 
as individual protein members of famUies with short peptide 
differences escape detection. In the gene-protein data- 
base of £. «A K-12 (14, 23). another major 2-dimensional gel 
database available at present, proteins are being identified by 
a wider range of tests that include comigration with purified 
proteins; genetic criterion (deletion, insertion, frameshift, 
nonsense, missense, regulatory), plasmid-bearing strains 
and in vitro synthesis of protein; selective labeling (methyla- 
lon. phosphorylation); peptide map similarity; and physio- 
logical cnterion and selective derivatization 



So far we ha\T received nearlv 550 antibodies from 

t«7S S Tl^' ^^'f r - s\i it 

tested b> 2.dimensional gel immunoblotting for antieen di 
terniination. Smiilarly. purified proteins ^and o Sndt 
^^'T ^' '"^"^ '»boratorij» ha« greatlv aided idemifi ^ 
tion of unknown proteins (20721). We routinely request an"- 

avaUabk all the information we may haxr accumulated on thai 
particular P">tem. For example. Table 1 lists entries availa- 
We for Lipocortin V (lEF SSP 8216). also known as annexin 
V. VAC-o. endonexin II. renoconin. chromobindin-5'. an- 
ticoagulant protein, PAP-I. rcalcimedin. IBC. calphobindin 
and anchonn CII. 

As mentioned previously, one distinct advantage of 
2-dimensional gel electrophoresis is the possibility of studv- 
mg quantitative variations in cellular protein patterns that 
may lead to identification of groups of proteins that are ex- 
pressed coordinately during a given biological process 
Quantitation, however, is not an easy task as reflected bv the 
lack of published data on global cellular protein patterns. We 
believe this is partly due to difficulties in obtaining sets of 
gels that are suitable for computer analysis (streaking 
material remaining at the origin, etc.) as well as to limita- 
tions (labonous editing time, need of calibration strips to 
merge images, limited dynamic range, etc.) in the computer 
analysis systems available at the moment. Perhaps the most 
advanced quantitative studies published so far using com- 
puter analysis have been carried out by Garrels and co- 
workers (18, 22). In particular, these investigators have estab- 
lished a quantitative rat protein database (18. 22) designed 
to study growth control (proliferation, growth inhibitors, and 
stimulation) and transformation in well-defined groups of 
«/V«""J' ^ transformation of rat REF52 cells with 

SV40, adenovinis, and the Kirsten murine sarcoma virus. 
These studies have revealed clusters of proteins induced or 
repressed during growth to confluence as well as groups of 
transformation-sensitive proteins that respond in a differen- 
tial fashion to transformation by DNA and RNA viruses. A 
most interesting feature of this quantitative database is the 
discovery of a group of coregulated proteins that show simi- 
lar expression patterns as the cell cycle-regulated DNA repli- 

l?C^Zt ^57 " ""^""^ 

In our human databases, most quantitations have been 
carried out by estimating the radioactivity contained in the 
po ypeptides by direct counting of the gel pieces in a scintil- 
lation counter (20, 21). Up to 700 proteins can be cut out 
through appropriate exposed films in a period of time com- 
parable to that required for editing a synthetic image. 
Manual quantitation of this large number of spots is difficuh 
without the assistance of a master reference image and a 
numbering system that can be used to identify the spots. Us- 
ing this approach, we have recorded quantitative changes in 
the relative abundance of 592 [»S)methionine.labeled pro- 
teins synthesized by quiescent, proliferating, and SV40 
transformed human embryonic lung MRC-5 fibroblasts (21). 
Some data concerning cytoskeletal and cytoskeletal-related 
proteins are presented in Fig. 2G. Our studies as well as 
those of Garrels and co-workers (18. 22) may in the long run 
help define patterns of gene expression that are characteristic 
of the transformed state. 

OTHER 2 DIMENSIONAL GEL PROTEIN 
DATABASES 

As mentioned previously there are other 2-dimensional gel 
databases available in computer form that have been pub- 
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TABLE I. Some entries for lipoconin V in the 




-V Percentage of loral protein 

:>. Apparent molecular weight (mr) 

4. Isoelectric point (pi) 

5. Method (or methods) of identification 

6. Credit to investigators that aided in 

idcnyficajion 

Antibody against protein 

8. Comigraiion with human proteins 

9. Cellular localization 

1 0. Calcium/phospholipid-dcpendcnt 

membrane protcms 

11. Function 



12. Partial amino acid sequence 

I'^ tDXA sequence 

14. Levels in fetal human tissues 



13. Levels in quiescent, proliferating, and 
transformed MRC-5 ribroblasts 

lb. Distribution in Triton supernatant and 
( yioskeictons 



PAP I. \ AC-a. 35->.caJc,med.n. IBC. caJphobindin I. anchorin CII. annrxin \ 



0.n07c (about 2.800.000 molecules per ceU) 



33.3 kDa 
4.76 

Microsequencing. 2.dimcnsional immunobioiiing. Comigraiicn 

PolvclonaJ (rabbit, antibody no. 20). B. Pepmsky. BIOGEN. Cambndpc 
Lipoconm X'.N.G. .Ahn. Howard Hughes .Medical Ins.i.u.c. Wash.ng.on L-nu.-,M.v 
Subcortical membrane 
Lipoconin V 

^dZ^n^a:^^^^ "'""^ inflammation, tmmune response. blo.Kl eoauul... 

GTVTDFPGFDER (7-18). VLTEHASR (109-117) 0\-VEEEVGSSLFnnvxr- 
(12.-143). ?GTDEEKFITIFGT(R) (187-201) ^ GSSLEDD\ \ C. 

Known R. Blake et al../ Biol. Chem, 263, 10799-10811- 1988 
(pl - 4./6 from translated sequence) 

Adrenal glands - ^ + + ; brain - ^ * + ■ 

cerebellum - + + + ; ear - + * + ; eve - + + + 

heart . * * + : hypophysis - + * * ; liver - * : 

lung - * + + ; meninges « + ^ + ; 

mesonephric tissue - + + + ; 

striated muscle . * + + ; pancreas . + + + ; 

skin - + * * : spleen stomach - * * + 

submandibular gland • + + + ; 

small intestine - + + + ; thvmus - + + + : 

thyroid gland - * + ; tongue - + * : 

ureter - + + * 

Q (quiescent) - 1 . 1 ; P (proliferating) - 1.0: 
T (S\'40 transformed) - 0.3 

Mainly supernatant 



lishcd in exicnso: these correspond to the £ coli K-P 
proiem-gene database (14. 23) and lo the rat REF52 data" 
base (18, 22). 

The £. coli K-12 cellular protein-gene database is perhaps 
the most complete of all databases reponed so far and even- 
tually II should trace each protein back to its structural gene 
Iniormaiion contained in this database includes: gene/pro- 
tcin name (protein name, EC number, gene name) 
--dimensional gel spot designations (.x-v coordinates from 
lelcrence gels, alphanumeric designation); genetic informa- 
non (linkage map location, physical map location, Genebank 
code, sequence reference, location on Kohara clones): bi- 
ochemical information (molecular weight. pL number of 
residues of each amino acid, mole percent of each amino 
acid, total number of amino acids in a polvpepiide) and 
regulator)- information (cellular level of protein in different 
media and different temperature, member of reguton. mem- 
ber ol stimulon). Major advances of this database arc en- 
visaged in the future in view of the eminent sequencing of 



the whole £. coli genome as well as the development of im- 
proved methods to express cloned genes. 

The rat REF52 2-dimensional gel protein database lists 
about 1600 proteins that have been recorded using the 
QUEST analysis system (18, 22). Included in this quantita- 
tive database arc ]) protein names (cytoskeletal and heat 
shock proteins as well as various nuclear, mitochondrial, and 
cytoplasmic proteins), 2) annotations (subcellular iocaliza- 
non, modification, recognition by specific antibodies, 
coprecipitaiion. NHrterminal sequence, cross-reference to 
protein sequence information and references to the litera- 
ture), 3) protein sets (cytoskeletal proteins, phosphoproicins. 
sets of proteins with PCNA/cyciin-like properties, etc.) and 
4) general quantitative data (protein synthesis during growth 
of normal REF52 cells to confluence and quiescence, and af- 
ter restimulation of growth-inhibited cells). 

In addition to the 2-dimensional gel databases mentioned 
so far there are several smaller cellular databases being es- 
tablished in human (normal human diploid fibroblasts, lym- 



phoc>tes, leukocxics. leukemic cells) mouse (NIH/3T3 ccUs, 
T K*mphocytes), Aplysta, yeasi (Saccharomyca certvisae), plants 
(wheat, barley, sorghum), and Euglem. Databases of tissue 
protein, (brain, whole mouse, liver) and body fluid proteins 
(plasma proteins, cerebrospinal fluid, urine, and milk) are 
being established in several laboratories. The reader is 
directed to the review by Celis ei al. (4) for details and refer- 
ences concerning these databases. 



MICROSEQUENCIXG HAS ADDED A NEW 
DIMENSION TO COMPREHENSIVE 
2.DIMENSIONAL GEL DATABASES: A DIRECT 
LINK BETWEEN PROTEINS AND GENES 

The development of highly sensitive amino acid gas-phase or 
liquid-phase scquenaiors (24), together with the establish- 
ment of efficient protein and peptide sample preparation 
methods, has opened the possibility to perform a systematic 
sequence analysis of proteins resolved by 2-dimensional gel 
electrophoresis. Indeed, generated pieces of protein se- 
quences can be used to search for protein identity (compari- 
son with available sequences stored in databanks) as well as 
for preparing specific DNA probes for cloning of as yet un- 
characterized proteins (Fig. 1). In addition, panial protein 
sequences can be stored in 2-dimensional gel databases (for 
example, see Fig. 2H) and off^er a unique link between pro- 
teins and genes (Fig. 1). 

In the early 1970s gel electrophoresis was used to purify 
proteins for sequencing purposes (reviewed by Weber and 
Osborn in ref 25). Proteins were recovered by diff'usion and 
sequenced by the manual dansyl-Edman degradation at the 
nanomole level. This technique was funher refined by using 
electro-elution to recover proteins and by miniaturizing the 
system (26). This method has been used extensively but 
showed increasing drawbacks (low yields, protein samples 
contaminated by free amino acids, and NH2-terminal block- 
ing) as the amounts of handled protein gradually became 
smaller (e.g., ai the 10 picomol level). 

Most of the problems referred to above have been 
minimized with the introduction of protein-eleciroblotting 
procedures (27-32). When proteins are blotted on chemi- 
cally inert membranes, it is possible to sequence the immobi- 
lized proteins directly without additional manipulations. 
Thus, depending on the amount of bound protein and its na- 
ture, this direct sequencing procedure generally yields NH2- 
terminal sequences containing 10-40 residues. As such, this 
technique was used to identify by their NHj-terminal se- 
quences, differentially expressed major proteins from total 
cellular extracts separated on 2-dimensiona] gels. A major 
difficulty encountered in this procedure is the occurrence of 
frequent anefactual blockage of the proteins. Several studies 
suggest that this phenomenon is mainly due to reaction with 
contaminants (panicularly . unpolymerized acrylamide 
present in the gel) and to a high dilution of the protein (low 
concentration of the protein per unit membrane surface). In 
addition to this primarily technical problem, many proteins 
are blocked in vivo by acylation or by a pyrrolidon carboxylic 
acid cap. 

The problem of panial or complete NHj-terminal block- 
age can be circumvented by generating internal amino acid 
sequences. This is achieved by fragmenting the protein 
present in the gel (gel in situ cleavage) or by cleaving it while 
bound to the membrane (membrane in situ cleavage) 
(33-35). In both cases, proteins are either cleaved in a res- 
tricted way (e.g., by limited enzymatic digestion or by using 
restriction chemical cleavage conditions) or fragmented into 
smaller peptides. 



Of the different combinations examined, wc had eoor 
results b>- using exhaustive proteolviic dieestion' or 
membrane-immobilized proteins. This method has bcrr- 
described for Ponceau red-stained proteins on niiroceliuio.- 
blots (34), for Amido-black^iained Immobilon-bound pr. 
terns, and for fluorcscamine'detecied proteins on glass fib. 
membranes (35). The proteases used (tr\-psin. chvriiotrN psii 
or pepsin) cleave at multiple sites, generating small pepiidcv 
that elute from the blot into the digestion buffer from which 
they are purified by re\-ersed- phase hieh performance liquid 
chromatography (HPLC) before bemg sequenced individu- 
ally Although each of these manipulations could be expected 
to result in a reduced yield of final sequence information, wc 
were surprised that the peptides could be sequenced with 
high efficienc>'. In our hands, this approach could be rou- 
tinely applied to gel-purified proteins available in amount^ 
ranging from 5 to 10 /ig. and often vielded sequence informa- 
tion covering more than 307f of the total protein. As 
membrane-immobilized proteins are not homogeneously 
digested, but rather show protease sensitivitv next to resis- 
tant regions, the number of peptides generated is much lower 
than expected from the number of potential cleavage sites. 
Consequently HPLC peptide chromatograms are less com- 
plex and most peptides can be recovered in pure form. 

As only limited amounts of a protein mixture can be 
loaded on a 2-dimensional gel, proteins of interest are often 
obtained in yields insufficient for the currentlv available se- 
quencing technoIog>-. More material can be obtained bv en- 
riching for a cenain subcellular fraction (purified cell or- 
ganelles) or by exploiting affinity (dves. metals, drugs, etc) or 
hydrophobic propenies of proteins before gel analvsis. All of 
the sequencing results accumulated so far in the human pro- 
tein database (20) (a few arc shown in Fig. 2H) haxT been 
obtained from analysis of protein spots collected from 
2-dimensionaI gels that had been stained with Coomassie 
blue according to standard procedures and dried for storage. 
Proteins are recovered from the collected gel pieces by a 
protein-elution-conceniration device, combined with gel 
electrophoresis and eleciroblotting. Details of this technique 
have been reported in a previous communication (42) and a 
brief outline is given below. 

Combined gel pieces are allowed to swell in gel sample 
buffer (a iota! volume of 1.5 ml). The gel pieces combined 
with the supernatant are then collected into a large slot made 
m a new gel. The slot is further filled with Sephadex G-10 
equilibrated in gel sample buffer. During consecutive gel 
electrophoresis, most of the electrical current passes on the 
side of the slot instead of passing through the slot. This 
results in both a vertical stacking and horizontal contraction 
of the protein band. With this device the protein is efficiently 
elutcd from the gel pieces and concentrated from a large 
volume into a narrow spot. The highly concentrated (about 
5 mm2) protein spot is then electroblotied on PVDF- 
mcmbranes, stained with Amido black, and in situ digested 
with tr\'psin. The peptides generated during digestion elute 
from the membrane into the supernatant, and can be sepa- 
rated by narrow bore reversed- phase HPLC and collected in- 
dividually for sequence analysis: 

Using this and previous procedures (37. 39, 42), %ve have 
so far analyzed 70 protein spots collected from 
2-dimensional gels (20, and unpublished observations) (see 
for example Fig. 2H). The sequence information amounts to 
2100 allocated residues corresponding to an average of 30 
residues per protein spot. So far we have made cDNAs of 
many of the unknown proteins that have been microse- 
quenced,.and a substantial number has been cloned and se- 
quenced. All available information indicates that it may be 
possible to obtain partial sequence information from most of 
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the proteins that can be visualized bv- Coomassic Brillani 
Blue staining. 

Panial protein sequences arc stored in the database as dis- 
played in Fig. 2H, and it should be possible in the near fu- 
ture to interface this information with fonhcoming D\A se- 
quence data from the human genome project. In the long 
run. as the human genome sequences become available it 
will be possible to assign panial protein sequences to genes 
:br which the full DNA sequence and chromosomal location 
;ire known (Fig. 1). 



SUMMARY 

The studies presented in this brief review are intended to 
demonstrate the usefulness of computer-aided 2-dimensional 
gel electrophoresis and microscquencing to analvze cellular 
protein patterns, and to link protein and DNA information. 
As more information is gathered worldwide, comprehensive 
databases will depict an integrated picture of the expression 
levels and propenies of the thousands of proteins that orches- 
trate most cellular functions. 

Clearly, databases allow easy access to a large bodv of data 
and provide an efficient medium to communicate stan- 
dardized protein information. In the future, databases will 
foster a wide variety of biological information that can be 
used to support collaborative research projects in basic and 
applied bioiog\' as well as in clinical research (2, 5. 46). Once 
.1 protein is identified in a particular database all the infor- 
nation gathered on it can be made available to the scientist. 
However, many problems must be solved before protein 
databases become of general use to the scientific community. 
\ most urgent one is to promote standardization of the gel 
running conditions so that data produced in a given labora- 
tory may be used worldwide. Surprisingly, the gel running 
technology as it stands today is still a craftmanship an. 

Finally, comprehensive, computerized databases of pro- 
teins, together with recently developed techniques to 
microsequence proteins, offer a new dimension to the study 
of genome organization and function (Fig. 1). In particular, 
human protein databases may become increasingly impor- 
tant in view of the concerted effort to map and sequence the 
entire human genome. This formidable task is e.xpecied to 
dominate biological research in the next decades. [^J 
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